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Video saliency has a profound effect on our lives with its compression 
efficiency and precision. There have been several types of research done on 
image saliency but not on video saliency. This paper proposes a modified high 
efficiency video coding (HEVC) algorithm with background modelling and 
the implication of classification into coding blocks. This solution first 
employs the G-picture in the fourth frame as a long-term reference and then it 
is quantized based on the algorithm that segregates using the background 
features of the image. Then coding blocks are introduced to decrease the 
complexity of the HEVC code, reduce time consumption and overall speed up 
the process of saliency. The solution is experimented upon with the dynamic 
human fixation 1K (DHF1K) dataset and compared with several other state- 
of-the-art saliency methods to showcase the reliability and efficiency of the 
proposed solution. 
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1. INTRODUCTION 

The human eye is a complex organ, the way it works with the brain to filter and analyses the necessary 
components in the image it sees has perplexed scientists for ages, and many have tried to replicate it using 
algorithms and computations. Using the process done in the brain, researchers have tried to develop methods 
to pick out those areas of interest from the given image, just like the human visual system. With the inclusion 
of deep learning technology, there has been a significant rise in methods of image saliency detection with 
remarkable accuracy when tested on large-scale static gaze datasets such as the silicon dataset [1]. However, 
there have been several types of research done in the field of image saliency detection; it is quite challenging 
to produce the same effect of dynamic fixation prediction with moving images or videos. Video saliency has a 
great role in video compression, captioning, object segmentation and so on. This has led to the classification 
of saliency into two models, namely salient object detection and human eye fixation prediction. The input is 
also of two types, dynamic and static saliency models. Static models, as the name suggests, have images as 
their input and likewise, dynamic models take video input. 

The inspiration for this paper has stemmed from various research papers based in a similar field. This 
paper have a significant impact on the world of saliency [2], [3]. These two papers have used the difference in 
the features between the surrounding and central patches to estimate visual saliency. This is another research 
that attempts to detect saliency by representing a combinational block, based on random walk models, of all 
neighboring blocks [4]—[6] has a unique technique that involves graphs and is named as graph-based visual 
saliency. It includes the formation of activation maps on certain features followed by normalization. It has an 
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amazing receiver operating characteristic (ROC) value of about 98%. They have a similar approach to the 
problem with difference being that it uses random walk models on a graph to imitate the eye movements [7]. 
Their first step was to extract intensity, colour and compactness features, construct a fully connected graph and 
then the proposed algorithm computes the stationary distribution of the Markov chain on the graph as a saliency 
map. They have used another method of saliency detection using spectral features of an image. Exploits the 
features of images like luminescence and colour, which helps in reducing computational complexity and gives 
accurate results [8]—[10]. The researchers have used a base method involving regional application in saliency 
detection. In this, the input image is first segregated into different regions for saliency levels to be applied to 
each of them uses global contrast features with spatial weighted coherency, while [11]-[15] uses robust 
background measures along with a principled optimization framework to integrate all low-level maps to create 
a final clean and uniform saliency map. All the above algorithms are used for still images and these help in 
creating algorithms for video saliency detection. To modify them to be able to accurately detect visual saliency, 
we would need motion information to imitate the human eye's perception of movement. They have quaternion 
representation using the features of images like colour, intensity and motion and employment of phase spectrum 
of quaternion Fourier transform. This methodology involves discriminant center surround hypothesis. It 
combines colour orientations, and spatial and temporal saliency by taking summation of the absolute difference 
between temporal gradients of central and surrounding regions [16]. They made use of feature extraction from 
the partially decoded data. It uses global and local spatiotemporal (GLST) features [17]. The compressed video 
bitstream is partially decoded to obtain discrete cosine transform (DCT) coefficients and motion vectors and 
then GLST features are extracted. Then the spatial and temporal maps are generated and fused to get the result. 
This paper uses random walk with restart methodology. They figure out temporal saliency distribution using 
motion distinctiveness, abrupt change and temporal consistency [18]. Then it is used as restarting distribution 
and steady-state distribution is used to find spatiotemporal saliency. 

All these researches and experiments tell us that many state-of-the-art methods are available in the 
uncompressed domain. Since videos and images are generally sent in a compressed format, these conventional 
algorithms do not perform well in these situations. The only way for them to work effectively on the available 
data is to fully decode the data but this increases time consumption and the complexity of the code. There has 
been some research to solve this problem [19]. Has tried to improve the DCT-domain transcoder or 
deterministic discrete-time (DDT), by proposing a fast extraction method for partial low-frequency coefficients 
in DCT domain motion compensation operation (DCT-MC). Zhang et al. [20] is redesigned to exploit the low- 
level compressed domain features from the bitstream. They uses object recognition for fast saliency detection. 
Colour clustering and region merging is based on spatiotemporal similarities, pixel edge extraction and regional 
classification [21]. They have similar video saliency detection methods [22]-[25]. They, have come across 
several methods for bettering the saliency area. One of them introduced the G-picture methodology, which 
meant that reference will be maintained, probably a second frame reference, for reducing the complexity of the 
high efficiency video coding (HEVC) algorithm, then there is the usage of a quantization parameter for 
quantizing the G-picture (ground) and with the employment of background reference prediction (BRP) and 
background difference prediction (BDP). Even small coding blocks called coding units were introduced to 
lower complexity and increase efficient compression [26]—[30]. However, all these works were in different 
times and different regions of work. We have tried to incorporate all these modifications to come up with a 
solution that not only reduces complexity but also helps in input size flexibility with reduced time consumption 
and better compression precision, accuracy and efficiency [31]—[33]. 

In this study, a modified version of the HEVC method is suggested that makes use of backdrop 
modeling with a hierarchical prediction structure (HPS). It consists of two parts. The first is the modification 
of the reference frame used, making G-picture the fourth reference frame rather than the second as stated in 
other research papers, quantizing it with a relatively smaller valued parameter, and adding coding blocks for 
less complexity. The division of each coding block into Fg, Bg and Hg is the second element. Depending on 
the information included in the G-picture, each of these elements is sped up in a unique way. To avoid further 
coding and calculation, another alteration is included in which the coding block portioning is halted early. 

There are a total of five sections in this essay. The introduction is covered in the first section, and the 
related works for this paper are listed in the second. The third section covers the mathematical and coding 
components of the suggested system, and the fourth section displays the outcomes of the tests done using the 
dataset dynamic human fixation 1K (DHFIK). The paper is then concluded in the fifth portion. 


2. LITERATURE SURVEY 

This section will provide a quick overview of the numerous studies and tests that have aided in the 
development of our solution. We now have a better iteration of the HEVC method, starting with [34]-[37], in 
which the perceptual redundancy has been decreased for higher compression value. With the use of a 


Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 431-440 


Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 Oo 433 


convolutional neural network, this suggested technique combines the motion estimation results from each block 
during the compression phase and employs adaptive dynamic fusion for the saliency map. The fundamental 
element of this suggested algorithm is the application of the spatiotemporal algorithm. The next one is a survey 
that has assisted in grouping and selecting the appropriate database as well as the modification approach for 
our suggested solution. They provides an up-to-date overview of all the video compression research along with 
its milestones [38]-[40]. It is done for conventional codec adaption along with learning-based end-to-end and 
their advantages and disadvantages. In their conclusion, the computation complexity is an issue that needs 
solving at the earliest available opportunity. This paper is another survey about the different saliency models 
available and what are the drawbacks that have led to insufficient accuracy and precision in compression [41]. 
The researcher has provided insight into the different ways the various saliency models can try to mimic the 
actual process of the human eye and brain. 

This has helped in making the right modification to our algorithm with actual practical comparisons. 
The main dataset that has been used for the proposed solution’s experiment as well as the dataset of the base 
reference that is used for comparison of our results [27], [42]. The dynamic human fixation 1K, often known 
as DHFIK, forecasts fixations when viewing dynamic scenes. With 1000 high-definition, diverse video clips 
taken by 17 observers while wearing eye trackers. Attentive convolutional neural network (ACLNet)-long sort- 
term memory (LSTM) network is a cutting-edge video saliency approach that has also been proposed. 
Additionally, it has contrasted its findings with those of other techniques using various datasets, including 
Hollywood-2 and University of Central Florida (UCF) sports. It was one of the quickest approaches up to this 
point. They have given us knowledge on hyper saliency [43]. Convolutional neural networks are trained using 
manual algorithmic annotations of smooth pursuits, and the findings are developed with the aid of 26 dynamic 
saliency models that are freely available online. Here another study that has aided in algorithm development? 
For prediction in dynamic scenarios, they have devised a brand-new 3-dimensional (3D) convolutional 
encoder-decoder architecture [41]. The encoder has two subnetworks that separate the spatial and temporal 
components of each frame and then fuse them. The decoder then aggregates temporal data and enlarges the 
features in spatial dimensions. It is tested on the DHF1K dataset after receiving end-to-end training. This is 
another survey of various video saliency methods available in today’s world that employs deep learning and 
has tried its level best to reach the human level of eye tracking movements and feature detection [44]. 

They provides a no-reference bitstream human vision system (NRHVS) based video quality 
assessment (VQA) [45]. The saliency maps are generated by extracting the features from the HEVC bitstream 
and then a visual memory model is created using saliency map statistics. The support vector regression pipeline 
helps in learning the approximate video quality. VS-video saliency (DeepVS2.0) is a video saliency prediction 
approach based on deep neural networks [39]. It has aided in comparing our outcomes and evaluating how we 
did against other cutting-edge techniques. In order to create the intra-frame saliency map, it has presented an 
object-to-motion convolutional neural network (OM-CNN) that learns spatiotemporal properties. Then, using 
the OM-CNN extracted features, a convolutional LSTM network is created to enable inter-frame saliency. 
Our baseline reference is [46]. For different levels of the 3D convolutional backbone for the video saliency 
mapping, it uses its spatiotemporal self-assessment (STSANet) model [47]-[50]. In order to integrate many 
levels with context in semantic and spatiotemporal subspaces, attentional multi-scale fusion (AMSF) is used. 


3. PROPOSED SYSTEM 
3.1. Optimizing low-delay hierarchical prediction structure efficiently 

In this part, we will briefly discuss the constituents of the low delay HPS of the HEVC test model. 
They are namely two components. One is called hierarchal quantization (HQ), which uses the data of the last 
frame and other prioritized frames from the last three short groups of frames, and the other is called hierarchal 
reference (HR). Where the quantization parameter of each important frame is the same as two less than its next 
image while the quantization parameter of the middle image in the short group of frames is equivalent to one 
more than the important frame’s value. To optimize it, we need to replace the fourth reference frame with the 
G-Picture (generated using a general running less complex algorithm). This will remain as a long-term 
reference. For this, we shall use the Lagrange rate-distortion optimization and this helps in evaluating the rate 
distortion (RD) cost C. Where Q denotes the quality of reconstructed video about the original, 7 denotes the 
number of bits and u denotes the Lagrange multiplier. There will be m input frames, and let (1, p) represent 
the rate-distortion cost of encoding the i-th picture (1;). p will represent the coding units' quantization parameter 
using a cost function. 


C=m+B í) 
C= Xi Yl, p) = Di Èr tp, lir Qir p Uirp) (2) 
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Where Ui rp 
smaller p’, it provides a better reference for a images (lj+1 ~ lj+a). Assuming that there are nj,, coding blocks 
for I,,1 for indexes e(j + 1,1)~e(j + 1, nj+1) for better reference Ij but the other coding blocks qj, for indexes 


tG + 1,1)~tG + 1, q)41) cannot do so. This is similar for lj+2~li+a with nj+s and this has better prediction than 


represents the motion vectors and Qi rp is the data prediction quantized with p. With a 


coding blocks indexed by t(j + s,1)~t(j + s, qj+s) and eG +s, 1)~e(j +s, njas). The new costing equation 
comes out to be as shown in (3). 


C’=1T,+T, +T,+7,4+T; (3) 
As it can be deciphered from (4), T} give the rate-distortion cost before I; is used, T, is the rate- 


distortion cost after using p’ for encoding, T is costing for coding blocks T41~]j+a, Ty is the rate-distortion 
cost for all the combined rates of the coding blocks for modified 1; and T; is cost for Ij,,. 


j-1 
=) VCP), 
i=1 


T, = PCI, p’, 

Ts = Die Des TH, licen Geen, iana 

T, = pean Eria TOP, Lect) QieGD,pr, Vien pr)» 

T; = Di=j+a+1 Pi, p). (4) 


The modified costing equation using p instead are shown in (5). Now calculating the difference 
between (5) and (6), we get (7). 


C=T,+X+T;+Y+Ts (5) 
X= W(Ij,p), 

Y= are Eil TP, ect QieGD,p, Viewnp)- (6) 
C=C 20ST) —th =) (7) 


In Ty, the term QieG1D,p, has lesser quantization loss than the term Qiegyp, G=j+i~j+tal= 
1~nj;) in Y due to this, the inequality is satisfied is shown in (8). 


ap 
Y-T, = Dinjer Drei (Ts lie) Qeanp, Yitenp) — TP lean Ueanp Yiewnp)) >9 (8) 


Thus, the conclusion can be stated as - “for a large a in rate-distortion cost, C’ that satisfies the equation 
Y — T; > T, —X, then C — C’ > 0.” There can be several conclusions drawn from the analysis of the equations 
above. If we frequently choose an image as a reference for the next batch of pictures, then the quantization 
parameters must be selected in such a way that the values are relatively smaller also on extending this 
conclusion. We can say that the G-picture, as it is taken as a long-term reference, must be quantized at a value 
lesser than the quantization parameter. 

For the above conclusion to work well there must be the availability of a large number of pixels that 
have the same features as the G-picture. These groups of pictures can be collectively put into similar 
background batches. There are other groups of images, they will be put under general-background-batch, and 
they will work without the G-picture, as it does not hold any significant advantage. A similar background batch 
needs to be reworked for better bit encoding for better quality preservation and compression. For this, we can 
have two types of quantizing methodologies, one is to use the same value as the one used for the general- 
background-batch (p) and the other is to use another value for the quantization parameter (p’), this is not the 
same as the one used in the first case and this helps in better rate-distortion values. General-background-batch’s 
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valuable frames will be denoted by Gg and we must follow (7) analogy. This means that using the G-picture as 
a long-term reference, we can quantize any batch of frames with a large valued quantization parameter than 
the one used for the adjacent frame in the batch of frames. 


3.2. Speeding up the algorithm 

The foreground units contain the basic coding blocks, which are 4 x 4 units in size and each input in 
the coding blocks is classified based on the number of basic blocks present in the foreground. Taking K(f) as 
input type for basic coding block f and g; j be a pixel value of basic unit f while for G-picture it is Gs; j then as 


shown in (9). 


Y, E Èj- | Sij — Gp; (8) <x 
H, otherwise 


K(f) = (9) 


Here, x is a predefined threshold valued at 80. Then taking in the basic blocks in the group of coding 
blocks (o is used for its representation), the categories of classes for the coding blocks are calculated with the 
help of the proportion values of foreground blocks (Fg), its background blocks (Bg) and its hybrid blocks 
(Hg). The size taken here is (2N x 2N) computed through (10). 


Fg, if4 x ||{i]K(o(@)) = H}]| /N? > a 
Class(o) = 4 Bg, if4 x ||{i]K(o(@)) = H}|| /N? < B (10) 
Hg, ifa > 4 x ||{iK(o()) = H}]| /N? > g 


a=0.5; B=0.0625 


In the traditional HEVC encoder, the encoding value is chosen between 2N X 2N coding blocks or 
just four recursively-coded parts. To avoid this confusion and reduce time consumption by not calculating and 
comparing the rate-distortion costs, there is a need for partition termination methods in the HEVC test model. 
For this, a static background for a large time is used. Each input is considered as a potential coding block and 
is segregated into the respective blocks as in (10). Bgs with a value of N > 8 will occupy a larger proportion 
than the other two and that needs an early termination. So, whenever there are 16 x 16, or N = 8 coding blocks 
then the Bg will be a pure version of the coding blocks and will not undergo further partition. There is also an 
issue regarding prediction pixels for coding blocks. For better accuracy, it has been decided that only 2N x 2N 
coding blocks must be used for Bg N > 8. The rest will have it for N => 8 and Hgs have no asymmetric motions 
partitions. In addition, the range for searching motion must be at 1 pixel for Bgs and unchanged for Hgs and F¢s. 


3.3. Modelling the background and selection 

We need to calculate the average of all background frames in a running fashion. J denotes the current 
frame in training, M is the matrix that has unsigned *-bit integers for average result representation. Then M’, 
that is, the average value is given by (11). 


M' = (Mx (m—1) +J + (m >> 1))/m (11) 


The number of training frames, m, is indicated here. Only one multiply, shift, floor, divide, and three 
extra operations are performed during this process. The first image, if it is large enough, will be spotted by the 
algorithm and can be thought of as a large group of frames for minimal time delay in the coding stage. Assume 
that this batch of frames' HPS has a size that is even. L and O(X,Y) = 1 or 0 demonstrates that X and Y have 
vast amounts of data with different/similar data proportions. Then, O(Jm) of any input picture with thickness 
m, where m is denoted as IL + i(10,i = OL — 1) and his represents the initial image, is determined by (12): 


general — background — patch, R(Jj,,,Gg) = 1 


00m) = ete — background — patch, RUjx,, Gg) = 0 (32) 


for R(X, Y) a 1-pixel range is taken to search in Y the basic units A. This is given by (13). Algorithm 1 mentioned 
for background modelling. 


1, if 16 x ||A(X,Y)/w x h > 0.8 


0, otherwise (13) 


R(X, Y) = { 
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h 
AX, Y) = (a p)| Xts=1 Kay setpes a Yaqrtáp+s = 80, p < Dq = 7} 


Algorithm 1. Algorithm for background modelling 
Input: Frame J, of size hxw, where m=1XL+i 
Output: O(Jm) type frame with general — background — patch or similar — background — patch 
if O(m) == OUm-;) andi#0; return 
X=JmiY = Gg; AK Y) =Ø 
for q=1 to h/4 do 
for p=1 to w/4 do 


< 80 


à 4 A 
if Dts=1 Kaa csedpes Yaost4pss 


AQX, Y) = AG Y) U {(q,p)}; 
end 
end 
if 16x ||A(X,Y)/w x h > 0.8; 
R(X, Y) = 1; 
else 
R(X, Y) = 0; 
if R(X, Y) == 1 then O(Jm) = general — background — patch 
else 0(Jm) = similar — background — patch 


In addition, if we take the starting intra image in the low-delay predictor of hierarchy algorithm and 
quantize it for each similar-background-patch, then the quantization value comes out to be as shown in (14). 


Py + 1, ifi=L-1 
PoCixixi) = 4 Pa + 2, a =L/2 (14) 
Pg + 3, ifi#>orL—1 


Now to effectively calculate the quantization parameters for each general-background-patch frame, 
we can follow the (15). 


E _ (Pg +2, ifi=L—-1 is 
0 Jixixi) = tp. + 4, ifix#L—1 ( ) 

Next, we must take the G-picture to be quantized at a lesser value for the surrounding frames, as 
shown in (16). 


ifD1 LS 
Jbp 3 


APg = 10, if—<—<-—; (16) 
p 


4. EXPERIMENTS AND RESULTS 

The entire work has been compared with Wang et al. [26] and uses the various saliency detection 
methods for our evaluation. To maintain uniformity in comparison, we have used the same datasets as 
mentioned by Wang et al. [26]. This will help in evaluating our performance and accuracy in terms of other 
state the art methods. The collection, referred to as DHF1K [27], has around 1,000 films with a frame rate of 
30 and a resolution of 640 x 360. There are 600 training tests, 300 testing exams, and 100 validation tests. 
The data from 17 observers is collected using the eye tracker. 


4.1. Evaluation metrics employed 

The selection of the experiment's evaluation metrics was aided by Bylinskii et al. [28]. Area under 
ROC curve (AUC), Pearson’s correlation coefficient (CC), normalized scanpath saliency (NSS). Similarity or 
histogram intersection (SIM), shuffled AUC, and AUC are the ones that were selected. These measurements 
have been useful for both self-evaluation and comparison with other cutting-edge techniques for determining 
video saliency. We have compared our proposed model with these existing models like temporal-spatial feature 
pyramid network (TSFP-Net) [29], hierarchical decoding for dynamic saliency prediction (HD2S) [30], visual 
features based convolutional encoder-decoder (ViNet) [31], deep learning approach (DeepVS) for radio 
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frequency (RF)-based vital signs sensing) [32], Chen et al. [33], efficient end-to-end audio classification 
convolutional neural network (ACLNet) [27], spatio-temporal self-attention 3D network (STRA-Net) [34], 
temporally-aggregating spatial encoder-decoder network (TASED-Net) [35], saliency prediction model with 
shuffled attentions and correlation (SalSAC) [36], saliency based exponential moving average (SalEMA) [37], 
unified image and video saliency modeling (UNISAL) [38], and spatial-temporal and self-attention encoding 
network (STSANet) [26] serves as the foundation for this solution model. 


4.2. Results 

The comparison among all the mentioned state-of-the-art methods is given in Table 1. As can be 
discerned from Table 1, the evaluation metrics for the proposed solution have outperformed almost all state- 
of-the-art methods. This has done best in the SIM metric while ViNet [31] has done best in the sauce. In the 
remaining list, the performance has been quite good and even the Kullback-Leibler divergence values of the 
base reference STSANet [26] and the proposed system are 1.344 and 1.297 respectively. 


Table 1. Comparison of all the values of the evaluation metrics mentioned for all the state-of-the-art methods 
along with our proposed system 
DHFIK 


METHOD CC NSS SIM AUC sAUC 
TSFP-Net [29] 0.517 2966 0392 0.912 0.723 
HD25 [30] 0.503 2812 0406 0.908 0.700 
ViNet [31] 0.511 2872 0381 0.908 0.729 
DeepVS [32] 0344 1.911 0256 0856 0.583 
Chen et al. [33] 0.476 2.685 0.353 0.900 0.680 
ACLNet [27] 0434 2354 0315 0890 0.601 


STRANet [34] 0.458 2.558 0.355 0.895 0.663 
TASED-Net [35] 0.470 2.667 0.361 0.895 0.712 
SalSAC [36] 0.479 2.673 0.357 0.896 0.697 
SalEMA [37] 0.449 2.574 0.466 0.890 0.667 
UNISAL [38] 0.490 2.776 0.390 0.901 0.691 
STSANet [26] 0.529 3.010 0.383 0.913 0.723 
Proposed system 0.547 3.109 0.407 0.933 0.701 


This tells us that dissimilarity for our proposed solution is much better and has outperformed once 
again. The accuracy of the suggested solution is significantly superior than the other evaluated methods, as 
shown in Figures 1 and 2, because it is much more in line with reality. This proves that the suggested answer 
is the most accurate and precise of all the alternatives. Figure 1 shows the comparison of all existing models 
with proposed system. Figure 2 shows the comparison of the ground truths with the proposed system and other 
state-of-the-art methods. 


DHFIK 
4.000 
3.000 
vo 
P= 
E 2.000 
S 
1.000 
ooo Aidnianmaniii e A 
CC NSS SIM AUC sAUC 
m TSFP-Net [29] m HD2S [30] m ViNet [31] m DeepVS [32] 
E Chen et al. [33] mACLNet [27] @ STRANet [34] m TASED-Net [35] 
E SalSAC [36] m SalEMA [37] m UNISAL [38] m STSANet [26] 


m Proposed System 


Figure 1. Comparison of methodologies 


Video saliency detection using modified high efficiency video coding ... (Sharada P. Narasimha) 


438 o ISSN: 2089-4864 


n ES - truth ee ACLNet >a eS IRANet — TASED-Net | S ViNet pi- 1] STSANet ga eke e Proposed solution 


Figure 2. Comparison of the ground truths with the proposed system and other state-of-the-art methods 


5. CONCLUSION 

In this paper, a modified HEVC technique with spatiotemporal saliency encoding and background 
adjustment was offered as a potential remedy. The use of the G-picture methodology in the fourth frame as a 
long-term reference frame is one of two strategies used to make this solution work. Then comes the need to 
use the coding blocks classification for background segregation for quantization of each frame respectively 
along with quantization of the G-picture as well. This has led to a reduction in time consumption and coding 
complexity along with an increase in efficiency and accuracy when the video is compressed. Even though the 
results display a good increase in almost every evaluation metric chosen for this paper, there is still quite 
enough room for improvement. We hope that this solution will act as a stepping-stone for other researchers to 
build on their future solutions in bringing video saliency detection closer to the level of humans and their eye 
and brain coordination. 
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